The Sum-to-Zero Constraint in Stan

Author

Mitzi Morris

Show Code
# libraries used in this notebook
import os
import numpy as np
import pandas as pd
import geopandas as gpd
import plotnine as p9
import libpysal
from splot.libpysal import plot_spatial_weights 
from random import randint

from cmdstanpy import CmdStanModel
import logging
cmdstanpy_logger = logging.getLogger("cmdstanpy")
cmdstanpy_logger.setLevel(logging.ERROR)

import warnings
warnings.filterwarnings('ignore')

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# notebook display options
plt.rcParams['figure.figsize'] = (7, 7)
plt.rcParams['figure.dpi'] = 100

np.set_printoptions(precision=2)
np.set_printoptions(suppress=True)
pd.set_option('display.precision', 2)
pd.options.display.float_format = '{:.2f}'.format


# helper functions

def extract_numeric_index(idx: pd.Index) -> pd.Series:
    return idx.str.extract(r'[a-z_]*\[(\d+)\]', expand=False).astype(int)

# add dividers every nth row
from pandas.io.formats.style import Styler
def style_dataframe(df: pd.DataFrame, modulus: int) -> Styler:
    def highlight_every_nth_row(row: pd.Series, row_index: int, modulus: int) -> list[str]:
        if (row_index + 1) % modulus == 0:  # add border
            return ['border-bottom: 3px double black'] * len(row)
        return [''] * len(row)
    return (df.style
              .apply(lambda row: highlight_every_nth_row(row, df.index.get_loc(row.name), modulus), axis=1)
              .format(precision=2)
           )


# Filter and sort predictors
def summarize_predictor(df: pd.DataFrame, name: str) -> pd.DataFrame:
    pred_summary = df.filter(regex=name, axis=0).sort_index()
    if "[" in name:
        pred_summary = pred_summary.sort_index(key=extract_numeric_index)
    
    return pred_summary[['Mean', 'StdDev', 'ESS_bulk', 'ESS/sec', 'R_hat']]


# side-by-side tables
from IPython.core.display import display, HTML
def display_side_by_side(
    html_left: str,
    html_right: str,
    title_left: str = "Small Dataset",
    title_right: str = "Large Dataset"
) -> None:
    """
    Displays two HTML tables side by side in a Jupyter Notebook.
    """
    html_code = f"""
    <div style="display: flex; justify-content: space-between; gap: 10px;">
        <div style="width: 48%; border: 1px solid #ddd; padding: 5px;">
            <b><i>{title_left}</i></b>
            {html_left}
        </div>
        <div style="width: 48%; border: 1px solid #ddd; padding: 5px;">
            <b><i>{title_right}</i></b>
            {html_right}
        </div>
    </div>
    """
    display(HTML(html_code))

Introducing the sum_to_zero_vector Constrained Parameter Type

The sum_to_zero_vector constrained parameter type was introduced in the Stan 2.36 release.

The parameter declaration:

  sum_to_zero_vector[K] beta;

produces a vector of size K such that sum(beta) = 0. The unconstrained representation requires only K - 1 values because the last is determined by the first K - 1.

Further discussion is in this post on the Stan Discourse forums

A sum to zero vector is exactly what the name suggests. A vector where the sum of the elements equals 0. If you put a normal prior on the zero-sum vector the resulting variance will be less than the intended normal variance. To get the same variance as the intended normal prior do

parameters {
  sum_to_zero_vector[N] z;
}
model {
  z ~ normal(0, sqrt(N * inv(N - 1)) * sigma)
}

where sigma is the intended standard deviation. FYI, it’s a bit more efficient to pre-calculate the sqrt(N * inv(N - 1)) in transformed_data. The general result to get a given variance from a normal with linear constraints is in: Fraser, D. A. S. (1951). Normal Samples With Linear Constraints and Given Variances. Canadian Journal of Mathematics, 3, 363–366. doi:10.4153/CJM-1951-041-9.

Prior to Stan 2.36, a sum-to-zero constraint could be implemented in one of two ways:

  • As a “hard” sum to zero constraint, where the parameter is declared to be an \(N-1\) length vector with a corresponding \(N\)-length transformed parameter whose first \(N-1\) elements are the same as the corresponding parameter vector, and the \(N^{th}\) element is the negative sum of the \(N-1\) elements.

  • As a “soft” sum to zero constraint with an \(N\)-length parameter vector whose sum is constrained to be within \(\epsilon\) of \(0\).

Up until now, users had to choose between the hard or soft sum-to-zero constraint, without clear guidance. As a general rule, for small vectors, the hard sum-to-zero constraint is more efficient; for larger vectors, the soft sum-to-zero constraint is faster, but much depends on the specifics of the model and the data.

For small \(N\) and models with sensible priors, the hard sum-to-zero is usually satisfactory. But as the size of the vector grows, it distorts the marginal variance of the \(N^{th}\). Given a parameter vector: \[ x_1, x_2, \dots, x_{N-1} \sim \text{i.i.d. } N(0, \sigma^2) \] by the properties of independent normal variables, each of the free elements \(x_1, \ldots, x_{N-1}\) has variance \(\sigma^2\). However, the \(N^{th}\) element is defined deterministically as: \[ x_N = -\sum_{i=1}^{N-1} x_i \] and its variance is inflated by a factor of \(N-1\). \[ \operatorname{Var}(x_N) = \operatorname{Var}\Bigl(-\sum_{i=1}^{N-1} x_i\Bigr) = \sum_{i=1}^{N-1} \operatorname{Var}(x_i) = (N-1)\sigma^2. \] For large vectors, MCMC samplers struggle with the hard sum-to-zero constraint, as every change to any of the \(N-1\) elements also requires a corresponding change to the \(N^{th}\) element; balancing these changes introduces potential non-identifiabilities.

The soft sum-to-zero constraint is problematic for the following reasons.

  • The tolerance \(\epsilon\) (the scale of the penalty) must be chosen by the analyst. Too large, and the result is too far from zero to be effective, too small and the sampler cannot satisfy the constraint.
  • The soft constraint only penalizes deviations from zero, leading to weaker identifiability of the parameters. This can lead to slow convergence and mixing, as the sampler explores nearly non-identified regions.
  • The marginal variances may not reflect the intended prior.

The sum_to_zero_vector transform ensures that each element of the resulting constrained vector has the same variance. This improves the sampler performance, providing fast computation and good effective sample size. This becomes increasingly noticeable as models increase in size and complexity. To demonstrate this, in this notebook we consider two different classes of models:

  • Multi-level regressions for binomial data with group-level categorical predictors.
  • Spatial models for areal data.
Note

The spatial models are taken from the a set of notebooks available from GitHub repo https://github.com/mitzimorris/geomed_2024.

For these models, we provide three implementations which differ only in the implementation of the sum-to-zero constraint: the built-in sum_to_zero_vector, and the hard and soft sum-to-zero implementations. We fit each model to the same dataset, using the same random seed, and then compare the summary statistics for the constrained parameter values. Since the models are equivalent, we expect that all three implementations should produce the same estimates; what differs is the speed of computation, as measured by effective samples per second.

Multi-level Models with Group-level Categorical Predictors

In this section we consider a model which estimates per-demographic disease prevalence rates for a population. The model is taken from the Gelman and Carpenter, 2020 Bayesian Analysis of Tests with Unknown Specificity and Sensitivity. It combines a model for multilevel regression and post-stratification with a likelihood that accounts for test sensitivity and specificity.

The data consists of:

  • A set of per-demographic aggregated outcomes of a diagnostic test procedure with unequal number of tests per demographic.

  • A corresponding set of demographic descriptors encoded as a vector of categorical values. In this example these are named sex, age, eth, and edu, but there can be any number of demographic predictors with any number of categories.

  • The specified test sensitivity and specificity

In order to fit this model, we need to put a sum-to-zero constraint on the categorical variables.

The Stan model

The full model is in file stan/binomial_4_preds_ozs.stan. It provides an estimate of the true prevalence based on binary tests with a given (or unknown) test sensitivity and specificity as follows.

transformed parameters {
  // true prevalence
  vector[N] p = inv_logit(beta_0 + beta_sex * sex_c + beta_age[age]
              + beta_eth[eth] + beta_edu[edu]);
  // incorporate test sensitivity and specificity.
  vector[N] p_sample = p * sens + (1 - p) * (1 - spec);
}
model {
  pos_tests ~ binomial(tests, p_sample);  // likelihood
  ...

To constrain the group-level parameters age, eth, and edu, we use the sum_to_zero_vector.

parameters {
  real beta_0;
  real beta_sex;
  real<lower=0> sigma_age, sigma_eth, sigma_edu;
  sum_to_zero_vector[N_age] beta_age;
  sum_to_zero_vector[N_eth] beta_eth;
  sum_to_zero_vector[N_edu] beta_edu;
}

In order to put a standard normal prior on beta_age, beta_eth, and beta_edu, we need to scale the variance, as suggested above. The scaling factors are pre-computed in the transformed data block, and applied as part of the prior.

transformed data {
  // scaling factors for marginal variances of sum_to_zero_vectors
  real s_age = sqrt(N_age * inv(N_age - 1));
  real s_eth = sqrt(N_eth * inv(N_eth - 1));
  real s_edu = sqrt(N_edu * inv(N_edu - 1));
}
  ...
model {
  ...
  // centered parameterization
  // scale normal priors on sum_to_zero_vectors
  beta_age ~ normal(0, s_age * sigma_age);
  beta_eth ~ normal(0, s_eth * sigma_eth);
  beta_edu ~ normal(0, s_edu * sigma_edu);
}

The data-generating program

To investigate the predictive behavior of this model at different timepoints in a pandemic, we have written a data-generating program to create datasets given the baseline disease prevalence, test specificity and sensitivity, the specified total number of diagnostic tests.

In the generated quantities block we use Stan’s PRNG functions to populate the true weights for the categorical coefficient vectors, and the relative percentages of per-category observations. Then we use a set of nested loops to generate the data for each demographic, using the PRNG equivalent of the model likelihood.

  • Because the modeled data pos_tests is generated according to the Stan model’s likelihood, the model is a priori well-specified with respect to the data.

  • Because the true parameters are defined in the generated quantities block, each sample provides a datasets from a different set of regression covariates and with different amounts of per-demographic data.

The full data-generating program is in file stan/gen_binomial_4_preds.stan. Here we show the nested loop which generates the modeled and unmodeled data inputs.

transformed data {
  int strata = 2 * N_age * N_eth * N_edu;
}
generated quantities {
  ...
  // generate true parameters via PRNG functions
  ...
  array[strata] int sex, age, eth, edu, pos_tests, tests;
  array[strata] real p;
  array[strata] real p_sample;

  int idx = 1;
  for (i_sex in 1:2) {
    for (i_age in 1:N_age) {
      for (i_eth in 1:N_eth) {
        for (i_edu in 1:N_edu) {

      // corresponds to unmodeled data inputs
          sex[idx] = i_sex; age[idx] = i_age; eth[idx] = i_eth; edu[idx] = i_edu;
          tests[idx] = to_int(pct_sex[i_sex] * pct_age[i_age]
                          * pct_eth[i_eth] * pct_edu[i_edu] * N);

      // corresponds to transformed parameters
          p[idx] = inv_logit(beta_0 + beta_sex * (i_sex)
                    + beta_age[i_age] + beta_eth[i_eth] +  beta_edu[i_edu]);
          p_sample[idx] = p[idx] * sens + (1 - p[idx]) * (1 - spec);

      // corresponds to likelihood
          pos_tests[idx] = binomial_rng(tests[idx], p_sample[idx]);
          idx += 1;
        }}}}
Note

The above set of nested for loops used here to generate the data is that same as would be used do to post-stratification the fitted model predictors. See section Coding MRP in Stan in the Stan User’s Guide.

Creating Simulated Datasets

The data generating program allows us to create datasets for large and small populations and for finer or more coarse-grained sets of categories. The larger the number of strata overall, the more observations are needed to get good coverage.

Instantiate the data generating model.
Show Code
datagen_model_file = os.path.join('stan', 'gen_binomial_4_preds.stan')
gen_mod = CmdStanModel(stan_file=datagen_model_file)
Specify the number of categories for age, eth, and edu.
Show Code
gen_data_dict = {
    'N_eth':3, 'N_edu':5, 'N_age':9, 
    'baseline': -3.5, 'sens': 0.75, 'spec': 0.9995}

strata = 2 * gen_data_dict['N_age'] * gen_data_dict['N_eth'] * gen_data_dict['N_edu']
Specify the total number of observations.

We generate two datasets: one with a small number of observations, relative to the number of strata, and one with sufficient data to provide information on all combinations of demographics.

Show Code
gen_data = gen_data_dict.copy()
gen_data['N'] = strata * 17

gen_data_lg = gen_data_dict.copy()
gen_data_lg['N'] = strata * 200

gen_data_tiny = gen_data_dict.copy()
gen_data_tiny['N'] = strata * 7
Run 1 sampling iteration to get a complete dataset.
Show Code
sim_data = gen_mod.sample(data=gen_data,
                          iter_warmup=1, iter_sampling=1, chains=1, seed=45678)

sim_data_lg = gen_mod.sample(data=gen_data_lg,
                          iter_warmup=1, iter_sampling=1, chains=1, seed=45678)

sim_data_tiny = gen_mod.sample(data=gen_data_tiny,
                          iter_warmup=1, iter_sampling=1, chains=1, seed=45678)
12:47:35 - cmdstanpy - INFO - CmdStan start processing
                                                                                
12:47:35 - cmdstanpy - INFO - CmdStan done processing.
12:47:35 - cmdstanpy - INFO - CmdStan start processing
                                                                                
12:47:35 - cmdstanpy - INFO - CmdStan done processing.
12:47:35 - cmdstanpy - INFO - CmdStan start processing
                                                                                
12:47:35 - cmdstanpy - INFO - CmdStan done processing.
Examine the set of generated data-generating params and resulting dataset.
Show Code
print(f'Small dataset: N = {gen_data["N"]}, strata = {strata}, expected obs per demographic {gen_data["N"] / strata}')
print(f'Large dataset: N = {gen_data_lg["N"]}, strata = {strata}, expected obs per demographic {gen_data_lg["N"] / strata}')
for var, value in sim_data.stan_variables().items():
    print(var, value[0]) if isinstance(value[0], np.float64) else print(var, value[0][:10])
Small dataset: N = 4590, strata = 270, expected obs per demographic 17.0
Large dataset: N = 54000, strata = 270, expected obs per demographic 200.0
beta_0 -3.5
beta_sex 0.591981
pct_sex [0.4 0.6]
pct_age [0.11 0.02 0.1  0.15 0.04 0.14 0.25 0.12 0.08]
beta_age [-0.74 -1.43 -1.67  1.   -0.52 -0.77  0.96  1.31  0.25]
pct_eth [0.13 0.64 0.22]
beta_eth [ 1.97 -0.8   0.75]
pct_edu [0.19 0.24 0.27 0.09 0.2 ]
beta_edu [ 0.24 -0.22  0.87 -0.85  1.5 ]
sex [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
age [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
eth [1. 1. 1. 1. 1. 2. 2. 2. 2. 2.]
edu [1. 2. 3. 4. 5. 1. 2. 3. 4. 5.]
pos_tests [0. 0. 2. 0. 2. 0. 0. 1. 0. 0.]
tests [ 5.  6.  7.  2.  5. 25. 32. 35. 12. 27.]
p [0.19 0.13 0.31 0.07 0.46 0.01 0.01 0.03 0.   0.05]
p_sample [0.14 0.1  0.23 0.06 0.34 0.01 0.01 0.02 0.   0.04]
idx 271.0

What is the distribution of the observed number of tests per demographic?

Show Code
tests = pd.Series(sim_data.stan_variable('tests')[0])
print("Small dataset, tests per demographic", tests.describe())
Small dataset, tests per demographic count   270.00
mean     16.49
std      18.51
min       0.00
25%       5.00
50%      10.00
75%      21.75
max     119.00
dtype: float64
Show Code
tests_lg = pd.Series(sim_data_lg.stan_variable('tests')[0])
print("Large dataset, tests per demographic", tests_lg.describe())
Large dataset, tests per demographic count    270.00
mean     199.49
std      218.00
min        4.00
25%       61.25
50%      123.00
75%      258.50
max     1401.00
dtype: float64
Show Code
tests_tiny = pd.Series(sim_data_tiny.stan_variable('tests')[0])
print("Tiny dataset, tests per demographic", tests_tiny.describe())
Tiny dataset, tests per demographic count   270.00
mean      6.48
std       7.63
min       0.00
25%       2.00
50%       4.00
75%       8.75
max      49.00
dtype: float64
Plot the distribution of observed positive tests and the underlying prevalence.

Because the data-generating parameters and percentage of observations per category are generated at random, some datasets may have very low overall disease rates and/or many unobserved strata, and will therefore be pathologically hard to fit. This is informative for understanding what is consistent when generating a set of percentages and regression weights as is done in the Stan data generating program.

  vector[N_eth] pct_eth = dirichlet_rng(rep_vector(1, N_eth));
  for (n in 1:N_eth) {
    beta_eth[n] = std_normal_rng();
  }

However, this can result in very unbalanced datasets, in which case it is best to generate another dataset and continue.

Show Code
sim_df = pd.DataFrame({'tests':sim_data.tests[0], 'pos_tests':sim_data.pos_tests[0], 'p_sample':sim_data.p_sample[0]})
sim_df['raw_prev'] = sim_df['pos_tests'] / sim_df['tests']
(
    p9.ggplot(sim_df)
    + p9.geom_density(p9.aes(x='raw_prev'), color='darkblue', fill='blue', alpha=0.3)
    + p9.geom_density(p9.aes(x='p_sample'), color='darkorange', fill='pink', alpha=0.3)
    + p9.labs(
        x='raw prevalence',
        y='',
        title='Observed (blue) and underlying true prevalence (pink)\nsmall dataset'
    )
    + p9.theme_minimal()
)

Show Code
sim_df = pd.DataFrame({'tests':sim_data_lg.tests[0], 'pos_tests':sim_data_lg.pos_tests[0], 'p_sample':sim_data_lg.p_sample[0]})
sim_df['raw_prev'] = sim_df['pos_tests'] / sim_df['tests']
(
    p9.ggplot(sim_df)
    + p9.geom_density(p9.aes(x='raw_prev'), color='darkblue', fill='blue', alpha=0.3)
    + p9.geom_density(p9.aes(x='p_sample'), color='darkorange', fill='pink', alpha=0.3)
    + p9.labs(
        x='raw prevalence',
        y='',
        title='Observed (blue) and underlying true prevalence (pink)\nlarge dataset'
    )
    + p9.theme_minimal()
)

Model Fitting

Assemble the data dictionary of all input data for the model which solves the inverse problem - i.e., estimates regression coefficients given the observed data. We use the generated data as the inputs. Because the output files are real-valued outputs, regardless of variable element type, model data variables of type int need to be cast to int. Here all the observed data is count and categorial data.

Show Code
data_fixed = {'N':sim_data.pos_tests.shape[1], 
              'N_age':gen_data_dict['N_age'], 
              'N_eth':gen_data_dict['N_eth'],
              'N_edu':gen_data_dict['N_edu'],
              'sens': gen_data_dict['sens'],
              'spec': gen_data_dict['spec'],
              'intercept_prior_mean': gen_data_dict['baseline'],
              'intercept_prior_scale': 2.5}

data_small = data_fixed | {'pos_tests':sim_data.pos_tests[0].astype(int),
             'tests':sim_data.tests[0].astype(int),
             'sex':sim_data.sex[0].astype(int),
             'age':sim_data.age[0].astype(int), 
             'eth':sim_data.eth[0].astype(int),
             'edu':sim_data.edu[0].astype(int)}

data_large = data_fixed | {'pos_tests':sim_data_lg.pos_tests[0].astype(int),
             'tests':sim_data_lg.tests[0].astype(int),
             'sex':sim_data_lg.sex[0].astype(int),
             'age':sim_data_lg.age[0].astype(int), 
             'eth':sim_data_lg.eth[0].astype(int),
             'edu':sim_data_lg.edu[0].astype(int)}

data_tiny = data_fixed | {'pos_tests':sim_data_tiny.pos_tests[0].astype(int),
             'tests':sim_data_tiny.tests[0].astype(int),
             'sex':sim_data_tiny.sex[0].astype(int),
             'age':sim_data_tiny.age[0].astype(int), 
             'eth':sim_data_tiny.eth[0].astype(int),
             'edu':sim_data_tiny.edu[0].astype(int)}

Record the data-generating parameters

Show Code
true_params = {
    'beta_0': sim_data.beta_0[0],
    'pct_sex': sim_data.pct_sex[0],
    'beta_sex': sim_data.beta_sex[0],
    'pct_age': sim_data.pct_age[0],
    'beta_age':sim_data.beta_age[0],
    'pct_eth': sim_data.pct_eth[0],
    'beta_eth':sim_data.beta_eth[0],
    'pct_edu': sim_data.pct_edu[0],
    'beta_edu':sim_data.beta_edu[0]
}
true_params
{'beta_0': -3.5,
 'pct_sex': array([0.4, 0.6]),
 'beta_sex': 0.591981,
 'pct_age': array([0.11, 0.02, 0.1 , 0.15, 0.04, 0.14, 0.25, 0.12, 0.08]),
 'beta_age': array([-0.74, -1.43, -1.67,  1.  , -0.52, -0.77,  0.96,  1.31,  0.25]),
 'pct_eth': array([0.13, 0.64, 0.22]),
 'beta_eth': array([ 1.97, -0.8 ,  0.75]),
 'pct_edu': array([0.19, 0.24, 0.27, 0.09, 0.2 ]),
 'beta_edu': array([ 0.24, -0.22,  0.87, -0.85,  1.5 ])}

Model 1: sum_to_zero_vector

Show Code
binomial_ozs_mod = CmdStanModel(stan_file=os.path.join('stan', 'binomial_4preds_ozs.stan'))
Show Code
binomial_ozs_fit = binomial_ozs_mod.sample(data=data_small, parallel_chains=4)
12:47:36 - cmdstanpy - INFO - CmdStan start processing
                                                                                                                                                                                                                                                                                                                                
12:47:36 - cmdstanpy - INFO - CmdStan done processing.
12:47:36 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_ozs.stan', line 55, column 2 to column 42)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_ozs.stan', line 56, column 2 to column 42)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_ozs.stan', line 55, column 2 to column 42)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_ozs.stan', line 55, column 2 to column 42)
Consider re-running with show_console=True if the above output is unclear!

Record the seed used for the first run and use it for all subsequent fits.

Show Code
a_seed = binomial_ozs_fit.metadata.cmdstan_config['seed']
Show Code
binomial_ozs_fit_lg = binomial_ozs_mod.sample(data=data_large, parallel_chains=4, seed=a_seed)
12:47:37 - cmdstanpy - INFO - CmdStan start processing
                                                                                                                                                                                                                                                                                                                                
12:47:37 - cmdstanpy - INFO - CmdStan done processing.
12:47:37 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_ozs.stan', line 56, column 2 to column 42)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_ozs.stan', line 55, column 2 to column 42)
Consider re-running with show_console=True if the above output is unclear!

Model 2: Hard sum-to-zero constraint

Run the sampler to get posterior estimates of the model conditioned on the data.

Show Code
binomial_hard_mod = CmdStanModel(stan_file=os.path.join('stan', 'binomial_4preds_hard.stan'))
Show Code
binomial_hard_fit = binomial_hard_mod.sample(data=data_small, parallel_chains=4, seed=a_seed)
12:47:37 - cmdstanpy - INFO - CmdStan start processing
                                                                                                                                                                                                                                                                                                                                
12:47:38 - cmdstanpy - INFO - CmdStan done processing.
12:47:38 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_hard.stan', line 45, column 2 to column 38)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_hard.stan', line 44, column 2 to column 38)
Consider re-running with show_console=True if the above output is unclear!
Show Code
binomial_hard_fit_lg = binomial_hard_mod.sample(data=data_large, parallel_chains=4, seed=a_seed)
12:47:38 - cmdstanpy - INFO - CmdStan start processing
                                                                                                                                                                                                                                                                                                                                
12:47:39 - cmdstanpy - INFO - CmdStan done processing.
12:47:39 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_hard.stan', line 45, column 2 to column 38)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_hard.stan', line 44, column 2 to column 38)
Consider re-running with show_console=True if the above output is unclear!

Model 3: soft sum-to-zero constraint

Show Code
binomial_soft_mod = CmdStanModel(stan_file=os.path.join('stan', 'binomial_4preds_soft.stan'))
Show Code
binomial_soft_fit = binomial_soft_mod.sample(data=data_small, parallel_chains=4, seed=a_seed)
12:47:39 - cmdstanpy - INFO - CmdStan start processing
                                                                                                                                                                                                                                                                                                                                
12:47:52 - cmdstanpy - INFO - CmdStan done processing.
12:47:52 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_soft.stan', line 45, column 2 to column 34)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_soft.stan', line 44, column 2 to column 34)
Consider re-running with show_console=True if the above output is unclear!
Show Code
binomial_soft_fit_lg = binomial_soft_mod.sample(data=data_large, parallel_chains=4, seed=a_seed)
12:47:52 - cmdstanpy - INFO - CmdStan start processing
                                                                                                                                                                                                                                                                                                                                
12:47:56 - cmdstanpy - INFO - CmdStan done processing.
12:47:56 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_soft.stan', line 45, column 2 to column 34)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_soft.stan', line 44, column 2 to column 34)
Consider re-running with show_console=True if the above output is unclear!

Runtime performance

In the small data regime, the soft-sum to zero takes considerably more wall-clock time to fit the data. On Apple M3 hardware, all three models quickly fit the large dataset.

Model Checking and Comparison

Check convergence

We check the R-hat and effective sample size (ESS) for all group-level parameters.

In order to do this multiway comparison, we assemble the individual summaries from the 6 runs above into two dataframes. We also compute the number of effective samples per second - a key metric of model efficiency. (Note: we’re using a development version of CmdStanPy to scrape time information from the CSV files because changes to CmdStan’s stansummary function removed the ESS/sec metric. This is a workaround for now).

Show Code
# small dataset
ozs_fit_summary = binomial_ozs_fit.summary(sig_figs=2)
ozs_fit_summary.index =  ozs_fit_summary.index.astype(str) + "  a) ozs"
ozs_fit_time = binomial_ozs_fit.time
ozs_total_time = 0
for i in range(len(ozs_fit_time)):
    ozs_total_time += ozs_fit_time[i]['total']
ozs_fit_summary['ESS/sec'] = ozs_fit_summary['ESS_bulk']/ozs_total_time

hard_fit_summary = binomial_hard_fit.summary(sig_figs=2)
hard_fit_summary.index = hard_fit_summary.index.astype(str) + "  b) hard"
hard_fit_time = binomial_hard_fit.time
hard_total_time = 0
for i in range(len(hard_fit_time)):
    hard_total_time += hard_fit_time[i]['total']
hard_fit_summary['ESS/sec'] = hard_fit_summary['ESS_bulk']/hard_total_time

soft_fit_summary = binomial_soft_fit.summary(sig_figs=2)
soft_fit_summary.index = soft_fit_summary.index.astype(str) + "  c) soft"
soft_fit_time = binomial_soft_fit.time
soft_total_time = 0
for i in range(len(soft_fit_time)):
    soft_total_time += soft_fit_time[i]['total']
soft_fit_summary['ESS/sec'] = soft_fit_summary['ESS_bulk']/soft_total_time

small_data_fits_summary = pd.concat([ozs_fit_summary, hard_fit_summary, soft_fit_summary])

# large dataset
ozs_fit_lg_summary = binomial_ozs_fit_lg.summary(sig_figs=2)
ozs_fit_lg_summary.index =  ozs_fit_lg_summary.index.astype(str) + "  a) ozs"
ozs_fit_time_lg = binomial_ozs_fit_lg.time
ozs_total_time_lg = 0
for i in range(len(ozs_fit_time_lg)):
    ozs_total_time_lg += ozs_fit_time_lg[i]['total']
ozs_fit_lg_summary['ESS/sec'] = ozs_fit_lg_summary['ESS_bulk']/ozs_total_time_lg

hard_fit_lg_summary = binomial_hard_fit_lg.summary(sig_figs=2)
hard_fit_lg_summary.index = hard_fit_lg_summary.index.astype(str) + "  b) hard"
hard_fit_time_lg = binomial_hard_fit_lg.time
hard_total_time_lg = 0
for i in range(len(hard_fit_time_lg)):
    hard_total_time_lg += hard_fit_time_lg[i]['total']
hard_fit_lg_summary['ESS/sec'] = hard_fit_lg_summary['ESS_bulk']/hard_total_time_lg

soft_fit_lg_summary = binomial_soft_fit_lg.summary(sig_figs=2)
soft_fit_lg_summary.index = soft_fit_lg_summary.index.astype(str) + "  c) soft"
soft_fit_time_lg = binomial_soft_fit_lg.time
soft_total_time_lg = 0
for i in range(len(soft_fit_time_lg)):
    soft_total_time_lg += soft_fit_time_lg[i]['total']
soft_fit_lg_summary['ESS/sec'] = soft_fit_lg_summary['ESS_bulk']/soft_total_time_lg

large_data_fits_summary = pd.concat([ozs_fit_lg_summary, hard_fit_lg_summary, soft_fit_lg_summary])

Eth

Show Code
beta_eth_summary = summarize_predictor(small_data_fits_summary, 'beta_eth\[')
beta_eth_summary_lg = summarize_predictor(large_data_fits_summary, 'beta_eth\[')
Show Code
small_html = style_dataframe(beta_eth_summary, 3).to_html()
large_html = style_dataframe(beta_eth_summary_lg, 3).to_html()
display_side_by_side(small_html, large_html)
Small Dataset
  Mean StdDev ESS_bulk ESS/sec R_hat
beta_eth[1] a) ozs 1.40 0.10 3700.00 1653.26 1.00
beta_eth[1] b) hard 1.40 0.10 3500.00 1143.79 1.00
beta_eth[1] c) soft 1.40 0.10 2400.00 51.29 1.00
beta_eth[2] a) ozs -1.50 0.08 4200.00 1876.68 1.00
beta_eth[2] b) hard -1.50 0.08 3400.00 1111.11 1.00
beta_eth[2] c) soft -1.50 0.08 2900.00 61.97 1.00
beta_eth[3] a) ozs 0.10 0.08 4500.00 2010.72 1.00
beta_eth[3] b) hard 0.10 0.08 7900.00 2581.70 1.00
beta_eth[3] c) soft 0.10 0.08 3100.00 66.25 1.00
Large Dataset
  Mean StdDev ESS_bulk ESS/sec R_hat
beta_eth[1] a) ozs 1.30 0.03 4500.00 1822.60 1.00
beta_eth[1] b) hard 1.30 0.03 3600.00 1130.30 1.00
beta_eth[1] c) soft 1.30 0.03 3700.00 212.29 1.00
beta_eth[2] a) ozs -1.50 0.02 4900.00 1984.61 1.00
beta_eth[2] b) hard -1.50 0.02 3700.00 1161.70 1.00
beta_eth[2] c) soft -1.50 0.02 4300.00 246.72 1.00
beta_eth[3] a) ozs 0.17 0.02 5400.00 2187.12 1.00
beta_eth[3] b) hard 0.17 0.02 8500.00 2668.76 1.00
beta_eth[3] c) soft 0.17 0.02 5200.00 298.35 1.00
Show Code
print("params", true_params['beta_eth'], "\npcts", true_params['pct_eth'])
params [ 1.97 -0.8   0.75] 
pcts [0.13 0.64 0.22]
Show Code
sigma_eth_summary = summarize_predictor(small_data_fits_summary, 'sigma_eth')
sigma_eth_summary_lg = summarize_predictor(large_data_fits_summary, 'sigma_eth')

small_html = style_dataframe(sigma_eth_summary, 3).to_html()
large_html = style_dataframe(sigma_eth_summary_lg, 3).to_html()
display_side_by_side(small_html, large_html)
Small Dataset
  Mean StdDev ESS_bulk ESS/sec R_hat
sigma_eth a) ozs 1.10 0.39 5100.00 2278.82 1.00
sigma_eth b) hard 1.40 0.45 4800.00 1568.63 1.00
sigma_eth c) soft 1.20 0.41 4700.00 100.44 1.00
Large Dataset
  Mean StdDev ESS_bulk ESS/sec R_hat
sigma_eth a) ozs 1.00 0.37 4900.00 1984.61 1.00
sigma_eth b) hard 1.30 0.42 4600.00 1444.27 1.00
sigma_eth c) soft 1.20 0.40 6000.00 344.25 1.00

Edu

Show Code
beta_edu_summary = summarize_predictor(small_data_fits_summary, 'beta_edu\[')
beta_edu_summary_lg = summarize_predictor(large_data_fits_summary, 'beta_edu\[')
Show Code
small_html = style_dataframe(beta_edu_summary, 3).to_html()
large_html = style_dataframe(beta_edu_summary_lg, 3).to_html()
display_side_by_side(small_html, large_html)
Small Dataset
  Mean StdDev ESS_bulk ESS/sec R_hat
beta_edu[1] a) ozs -0.19 0.12 4700.00 2100.09 1.00
beta_edu[1] b) hard -0.19 0.12 4600.00 1503.27 1.00
beta_edu[1] c) soft -0.18 0.12 3900.00 83.34 1.00
beta_edu[2] a) ozs -0.36 0.12 4700.00 2100.09 1.00
beta_edu[2] b) hard -0.37 0.12 4200.00 1372.55 1.00
beta_edu[2] c) soft -0.36 0.12 4200.00 89.76 1.00
beta_edu[3] a) ozs 0.63 0.10 4100.00 1831.99 1.00
beta_edu[3] b) hard 0.62 0.10 3700.00 1209.15 1.00
beta_edu[3] c) soft 0.63 0.10 3100.00 66.25 1.00
beta_edu[4] a) ozs -1.20 0.22 3100.00 1385.17 1.00
beta_edu[4] b) hard -1.20 0.21 2700.00 882.35 1.00
beta_edu[4] c) soft -1.20 0.21 2200.00 47.01 1.00
beta_edu[5] a) ozs 1.10 0.12 3800.00 1697.94 1.00
beta_edu[5] b) hard 1.10 0.11 3800.00 1241.83 1.00
beta_edu[5] c) soft 1.10 0.11 3100.00 66.25 1.00
Large Dataset
  Mean StdDev ESS_bulk ESS/sec R_hat
beta_edu[1] a) ozs -0.05 0.04 4700.00 1903.60 1.00
beta_edu[1] b) hard -0.05 0.04 3800.00 1193.09 1.00
beta_edu[1] c) soft -0.05 0.04 6000.00 344.25 1.00
beta_edu[2] a) ozs -0.53 0.04 4500.00 1822.60 1.00
beta_edu[2] b) hard -0.53 0.04 4600.00 1444.27 1.00
beta_edu[2] c) soft -0.53 0.04 5400.00 309.83 1.00
beta_edu[3] a) ozs 0.60 0.03 5400.00 2187.12 1.00
beta_edu[3] b) hard 0.60 0.03 4000.00 1255.89 1.00
beta_edu[3] c) soft 0.60 0.03 4600.00 263.93 1.00
beta_edu[4] a) ozs -1.20 0.06 4600.00 1863.10 1.00
beta_edu[4] b) hard -1.20 0.06 2900.00 910.52 1.00
beta_edu[4] c) soft -1.20 0.06 4000.00 229.50 1.00
beta_edu[5] a) ozs 1.20 0.03 4500.00 1822.60 1.00
beta_edu[5] b) hard 1.20 0.03 3900.00 1224.49 1.00
beta_edu[5] c) soft 1.20 0.03 5500.00 315.57 1.00
Show Code
print("params", true_params['beta_edu'], "\npcts", true_params['pct_edu'])
params [ 0.24 -0.22  0.87 -0.85  1.5 ] 
pcts [0.19 0.24 0.27 0.09 0.2 ]
Show Code
sigma_edu_summary = summarize_predictor(small_data_fits_summary, 'sigma_edu')
sigma_edu_summary_lg = summarize_predictor(large_data_fits_summary, 'sigma_edu')

small_html = style_dataframe(sigma_edu_summary, 3).to_html()
large_html = style_dataframe(sigma_edu_summary_lg, 3).to_html()
display_side_by_side(small_html, large_html)
Small Dataset
  Mean StdDev ESS_bulk ESS/sec R_hat
sigma_edu a) ozs 0.84 0.30 3600.00 1608.58 1.00
sigma_edu b) hard 0.85 0.33 4100.00 1339.87 1.00
sigma_edu c) soft 0.91 0.30 3900.00 83.34 1.00
Large Dataset
  Mean StdDev ESS_bulk ESS/sec R_hat
sigma_edu a) ozs 0.88 0.29 5200.00 2106.12 1.00
sigma_edu b) hard 0.87 0.31 5400.00 1695.45 1.00
sigma_edu c) soft 0.96 0.31 6100.00 349.99 1.00

Age

Show Code
beta_age_summary = summarize_predictor(small_data_fits_summary, 'beta_age\[')
beta_age_summary_lg = summarize_predictor(large_data_fits_summary, 'beta_age\[')
Show Code
small_html = style_dataframe(beta_age_summary, 3).to_html()
large_html = style_dataframe(beta_age_summary_lg, 3).to_html()
display_side_by_side(small_html, large_html)
Small Dataset
  Mean StdDev ESS_bulk ESS/sec R_hat
beta_age[1] a) ozs -0.78 0.20 6400.00 2859.70 1.00
beta_age[1] b) hard -0.78 0.20 3700.00 1209.15 1.00
beta_age[1] c) soft -0.78 0.21 3600.00 76.93 1.00
beta_age[2] a) ozs -1.10 0.53 1900.00 848.97 1.00
beta_age[2] b) hard -1.20 0.54 1900.00 620.92 1.00
beta_age[2] c) soft -1.20 0.54 1600.00 34.19 1.00
beta_age[3] a) ozs -1.30 0.23 5500.00 2457.55 1.00
beta_age[3] b) hard -1.30 0.25 4200.00 1372.55 1.00
beta_age[3] c) soft -1.30 0.24 3800.00 81.21 1.00
beta_age[4] a) ozs 1.20 0.15 3500.00 1563.90 1.00
beta_age[4] b) hard 1.20 0.14 2800.00 915.03 1.00
beta_age[4] c) soft 1.20 0.15 2400.00 51.29 1.00
beta_age[5] a) ozs 0.04 0.28 4700.00 2100.09 1.00
beta_age[5] b) hard 0.03 0.28 4400.00 1437.91 1.00
beta_age[5] c) soft 0.04 0.28 3700.00 79.07 1.00
beta_age[6] a) ozs -0.79 0.19 4200.00 1876.68 1.00
beta_age[6] b) hard -0.79 0.19 3900.00 1274.51 1.00
beta_age[6] c) soft -0.79 0.18 3800.00 81.21 1.00
beta_age[7] a) ozs 0.93 0.13 2800.00 1251.12 1.00
beta_age[7] b) hard 0.94 0.13 2500.00 816.99 1.00
beta_age[7] c) soft 0.94 0.13 1900.00 40.60 1.00
beta_age[8] a) ozs 1.10 0.16 3600.00 1608.58 1.00
beta_age[8] b) hard 1.20 0.15 3000.00 980.39 1.00
beta_age[8] c) soft 1.20 0.16 2600.00 55.56 1.00
beta_age[9] a) ozs 0.68 0.18 3600.00 1608.58 1.00
beta_age[9] b) hard 0.71 0.18 3200.00 1045.75 1.00
beta_age[9] c) soft 0.69 0.19 3400.00 72.66 1.00
Large Dataset
  Mean StdDev ESS_bulk ESS/sec R_hat
beta_age[1] a) ozs -0.59 0.06 5700.00 2308.63 1.00
beta_age[1] b) hard -0.59 0.06 5100.00 1601.26 1.00
beta_age[1] c) soft -0.59 0.06 5300.00 304.09 1.00
beta_age[2] a) ozs -1.30 0.16 2100.00 850.55 1.00
beta_age[2] b) hard -1.30 0.16 2300.00 722.14 1.00
beta_age[2] c) soft -1.30 0.16 3000.00 172.13 1.00
beta_age[3] a) ozs -1.50 0.08 5900.00 2389.63 1.00
beta_age[3] b) hard -1.50 0.08 5500.00 1726.84 1.00
beta_age[3] c) soft -1.50 0.08 5400.00 309.83 1.00
beta_age[4] a) ozs 1.20 0.04 3700.00 1498.58 1.00
beta_age[4] b) hard 1.20 0.04 3300.00 1036.11 1.00
beta_age[4] c) soft 1.20 0.04 3400.00 195.08 1.00
beta_age[5] a) ozs -0.28 0.09 4800.00 1944.11 1.00
beta_age[5] b) hard -0.28 0.09 5000.00 1569.86 1.00
beta_age[5] c) soft -0.28 0.08 5500.00 315.57 1.00
beta_age[6] a) ozs -0.62 0.05 4400.00 1782.10 1.00
beta_age[6] b) hard -0.62 0.05 4300.00 1350.08 1.00
beta_age[6] c) soft -0.62 0.05 4400.00 252.45 1.00
beta_age[7] a) ozs 1.10 0.04 2800.00 1134.06 1.00
beta_age[7] b) hard 1.10 0.04 2900.00 910.52 1.00
beta_age[7] c) soft 1.10 0.04 3100.00 177.86 1.00
beta_age[8] a) ozs 1.50 0.04 3400.00 1377.08 1.00
beta_age[8] b) hard 1.50 0.04 3300.00 1036.11 1.00
beta_age[8] c) soft 1.50 0.04 4100.00 235.24 1.00
beta_age[9] a) ozs 0.45 0.06 3700.00 1498.58 1.00
beta_age[9] b) hard 0.45 0.06 3500.00 1098.90 1.00
beta_age[9] c) soft 0.45 0.05 4800.00 275.40 1.00
Show Code
print("params", true_params['beta_age'], "\npcts", true_params['pct_age'])
params [-0.74 -1.43 -1.67  1.   -0.52 -0.77  0.96  1.31  0.25] 
pcts [0.11 0.02 0.1  0.15 0.04 0.14 0.25 0.12 0.08]
Show Code
sigma_age_summary = summarize_predictor(small_data_fits_summary, 'sigma_age')
sigma_age_summary_lg = summarize_predictor(large_data_fits_summary, 'sigma_age')

small_html = style_dataframe(sigma_age_summary, 3).to_html()
large_html = style_dataframe(sigma_age_summary_lg, 3).to_html()
display_side_by_side(small_html, large_html)
Small Dataset
  Mean StdDev ESS_bulk ESS/sec R_hat
sigma_age a) ozs 0.99 0.26 3700.00 1653.26 1.00
sigma_age b) hard 1.10 0.28 3700.00 1209.15 1.00
sigma_age c) soft 1.00 0.26 2900.00 61.97 1.00
Large Dataset
  Mean StdDev ESS_bulk ESS/sec R_hat
sigma_age a) ozs 1.00 0.25 5200.00 2106.12 1.00
sigma_age b) hard 1.20 0.28 5100.00 1601.26 1.00
sigma_age c) soft 1.10 0.27 5800.00 332.78 1.00

All models have R-hat values of 1.00 for all group-level parameters and high effective sample sizes.

Comparison with the true parameters shows that the model recovers the sign of the parameter, but not the exact value. With more data and only a few categories, the model does a better job of recovering the true parameters.

In almost all cases, estimates for each parameter are the same across implementations to 2 significant figures. In a few cases they are off by 0.01; where they are off, the percentage of observations for that parameter is correspondingly low. This is as expected; all three implementations of the sum-to-zero constraint do the same thing; the sum_to_zero_vector implementation is both fast and efficient.

Calibration check

All models contain a generated quantities block, which creates y_rep, the posterior predictive sample. If the model is well-calibrated for the data, we expect that at least 50% of the time the observed value of y will fall in the central 50% interval of the y_rep sample estimates.

Show Code
from utils_dataviz import ppc_central_interval

y_rep_ozs = binomial_ozs_fit.y_rep.astype(int)
print("sum_to_zero_vector fit", ppc_central_interval(y_rep_ozs, sim_data.pos_tests[0]))

y_rep_hard = binomial_hard_fit.y_rep.astype(int)
print("Hard sum-to-zero fit", ppc_central_interval(y_rep_hard, sim_data.pos_tests[0]))

y_rep_soft = binomial_soft_fit.y_rep.astype(int)
print("Soft sum-to-zero fit", ppc_central_interval(y_rep_soft, sim_data.pos_tests[0]))
sum_to_zero_vector fit y total: 270, ct y is within y_rep central 50% interval: 225, pct: 83.33
Hard sum-to-zero fit y total: 270, ct y is within y_rep central 50% interval: 224, pct: 82.96
Soft sum-to-zero fit y total: 270, ct y is within y_rep central 50% interval: 222, pct: 82.22

Prior predictive checks

Prior and posterior predictive checks are two cases of the general concept of predictive checks, just conditioning on different things (no data and the observed data, respectively). In the previous section, we compared the y_rep, the replicated dataset, with the observed dataset y.

Prior predictive checks simulate data directly from the prior, in the absense of any observed data. The resulting datasets, often called y_sim, instead of y_rep, show the possible range of data that is consistent with the priors. Here we use the simulated dataset to examine the prior marginal variances of the elements of the sum-to-zero vector under the hard-sum-to-zero constraint and the built-in sum_to_zero transform.

Just as we wrote a Stan program corresponding to the true data-generating model, resulting in the observed data y, we can write a Stan program which simply omits the likelihood statement from the model block, as well as any corresponding computations in the transformed parameters and generated quantities block.

To do this, we delete the likelihood statement, and any statements that generate auxiliary variables it needs. The model parameters block is unchanged.

binomial_4preds_ozs_ppc.stan

// generate sample from model priors, (before seeing any data)
data {
  int<lower=1> N; // number of strata
  int<lower=1> N_age;
  int<lower=1> N_eth;
  int<lower=1> N_edu;
  // omit observational data
}
transformed data {
  // scaling factors for marginal variances of sum_to_zero_vectors
  // https://discourse.mc-stan.org/t/zero-sum-vector-and-normal-distribution/38296
  real s_age = sqrt(N_age * inv(N_age - 1));
  real s_eth = sqrt(N_eth * inv(N_eth - 1));
  real s_edu = sqrt(N_edu * inv(N_edu - 1));
}
parameters {
  real beta_0;
  real beta_sex;
  real<lower=0> sigma_age, sigma_eth, sigma_edu;
  sum_to_zero_vector[N_age] beta_age;
  sum_to_zero_vector[N_eth] beta_eth;
  sum_to_zero_vector[N_edu] beta_edu;
}
model {
  // omit likelihood
  // priors
  beta_0 ~ normal(0, 2.5);
  beta_sex ~ std_normal();
  sigma_eth ~ std_normal();
  sigma_age ~ std_normal();
  sigma_edu ~ std_normal();

  // centered parameterization
  // scale normal priors on sum_to_zero_vectors
  beta_age ~ normal(0, s_age * sigma_age);
  beta_eth ~ normal(0, s_eth * sigma_eth);
  beta_edu ~ normal(0, s_edu * sigma_edu);
}

Running this model will produce a sample of draws according to the prior distribution; from this we can infer the range of possible parameter values which are consistent with these priors.

Show Code
binomial_ozs_ppc_mod = CmdStanModel(stan_file=os.path.join('stan', 'binomial_4preds_ozs_ppc.stan'))
binomial_ozs_ppc_fit = binomial_ozs_ppc_mod.sample(data=data_small, parallel_chains=4, seed=a_seed)
12:48:09 - cmdstanpy - INFO - CmdStan start processing
                                                                                                                                                                                                                                                                                                                                
12:48:11 - cmdstanpy - INFO - CmdStan done processing.
12:48:11 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_ozs_ppc.stan', line 35, column 2 to column 42)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_ozs_ppc.stan', line 36, column 2 to column 42)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_ozs_ppc.stan', line 35, column 2 to column 42)
Consider re-running with show_console=True if the above output is unclear!
12:48:11 - cmdstanpy - WARNING - Some chains may have failed to converge.
    Chain 1 had 114 divergent transitions (11.4%)
    Chain 1 had 60 iterations at max treedepth (6.0%)
    Chain 2 had 79 divergent transitions (7.9%)
    Chain 2 had 565 iterations at max treedepth (56.5%)
    Chain 3 had 33 divergent transitions (3.3%)
    Chain 3 had 122 iterations at max treedepth (12.2%)
    Chain 4 had 122 divergent transitions (12.2%)
    Chain 4 had 499 iterations at max treedepth (49.9%)
    Use the "diagnose()" method on the CmdStanMCMC object to see further information.
Show Code
binomial_hard_ppc_mod = CmdStanModel(stan_file=os.path.join('stan', 'binomial_4preds_hard_ppc.stan'))
binomial_hard_ppc_fit = binomial_hard_ppc_mod.sample(data=data_small, parallel_chains=4, seed=a_seed)
12:48:11 - cmdstanpy - INFO - CmdStan start processing
                                                                                                                                                                                                                                                                                                                                
12:48:11 - cmdstanpy - INFO - CmdStan done processing.
12:48:11 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_hard_ppc.stan', line 32, column 2 to column 38)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_hard_ppc.stan', line 33, column 2 to column 38)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_hard_ppc.stan', line 32, column 2 to column 38)
Consider re-running with show_console=True if the above output is unclear!
12:48:11 - cmdstanpy - WARNING - Some chains may have failed to converge.
    Chain 1 had 6 divergent transitions (0.6%)
    Chain 2 had 5 divergent transitions (0.5%)
    Chain 3 had 17 divergent transitions (1.7%)
    Chain 4 had 14 divergent transitions (1.4%)
    Use the "diagnose()" method on the CmdStanMCMC object to see further information.

Without any data, the sampler has many divergent transitions because these priors are putting positive probability on regions of the parameter space with high curvature and / or low numerical accuracy; however, conditional on the data, those regions have zero probability, cf: this discussion.

Here, we are interested in the marginal variances of the elements of the sum-to-zero effect, in order to investigate the correlation of the constrained parameters, i.e., the \(N^{th}\) element, so we ignore these warnings, since we know that with data, these warnings go away.

Marginal variances of the built-in zero_sum_vector

Show Code
age_ozs = binomial_ozs_ppc_fit.beta_age
np.var(age_ozs, axis=0)
array([0.16, 0.13, 0.09, 0.08, 0.08, 0.08, 0.08, 0.06, 0.14])

Marginal variances of the hard sum-to-zero constraint

Show Code
age_hard = binomial_hard_ppc_fit.beta_age
np.var(age_hard, axis=0)
array([1.06, 1.12, 1.3 , 1.2 , 1.02, 1.1 , 1.13, 1.03, 8.57])

By simulating data from the priors, we can see how the hard sum-to-zero constraint distorts the variance of the \(N^{th}\) element. This is only a problem for very sparse datasets, where the prior swamps the data. To see this, we fit both models to the tiny dataset.

Show Code
binomial_ozs_fit_tiny = binomial_ozs_mod.sample(data=data_tiny, parallel_chains=4, seed=a_seed)
binomial_hard_fit_tiny = binomial_hard_mod.sample(data=data_tiny, parallel_chains=4, seed=a_seed)
12:48:11 - cmdstanpy - INFO - CmdStan start processing
                                                                                                                                                                                                                                                                                                                                
12:48:12 - cmdstanpy - INFO - CmdStan done processing.
12:48:12 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_ozs.stan', line 56, column 2 to column 42)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_ozs.stan', line 55, column 2 to column 42)
Consider re-running with show_console=True if the above output is unclear!
12:48:12 - cmdstanpy - INFO - CmdStan start processing
                                                                                                                                                                                                                                                                                                                                
12:48:13 - cmdstanpy - INFO - CmdStan done processing.
12:48:13 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_hard.stan', line 45, column 2 to column 38)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_hard.stan', line 44, column 2 to column 38)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'binomial_4preds_hard.stan', line 44, column 2 to column 38)
Consider re-running with show_console=True if the above output is unclear!
Show Code
age_ozs_tiny = binomial_ozs_fit_tiny.beta_age
marginal_vars_ozs = np.var(age_ozs, axis=0)
age_hard_tiny = binomial_hard_fit_tiny.beta_age
marginal_vars_hard = np.var(age_hard, axis=0)
print("Tiny dataset, marginal variances beta age - sum_to_zero_vector\n", marginal_vars_ozs)
print("\n\nTiny dataet, marginal variances beta age - hard sum-to-zero constraint\n", marginal_vars_hard)
Tiny dataset, marginal variances beta age - sum_to_zero_vector
 [0.16 0.13 0.09 0.08 0.08 0.08 0.08 0.06 0.14]


Tiny dataet, marginal variances beta age - hard sum-to-zero constraint
 [1.06 1.12 1.3  1.2  1.02 1.1  1.13 1.03 8.57]

With more data, this problem goes away. To see this, we compare the marginal variances from the small dataset fits.

Show Code
age_ozs = binomial_ozs_fit.beta_age
marginal_vars_ozs = np.var(age_ozs, axis=0)
age_hard = binomial_hard_fit.beta_age
marginal_vars_hard = np.var(age_hard, axis=0)
print("Small dataset, marginal variances beta age - sum_to_zero_vector\n", marginal_vars_ozs)
print("\n\nSmall dataset, marginal variances beta age - hard sum-to-zero constraint\n", marginal_vars_hard)
Small dataset, marginal variances beta age - sum_to_zero_vector
 [0.04 0.28 0.05 0.02 0.08 0.04 0.02 0.03 0.03]


Small dataset, marginal variances beta age - hard sum-to-zero constraint
 [0.04 0.29 0.06 0.02 0.08 0.04 0.02 0.02 0.03]

Discussion

For the multi-level model, the sum_to_zero_vector provides fast results and good effective sample sizes for both datasets. Model stan/binomial_4_preds_ozs.stan shows how to properly scale the variance of a sum_to_zero_vector constrained parameter in order to put a standard normal prior on it.

Workflow Practices

  • Prior predictive checks demonstrate the difference between the marginal variances of the sum_to_zero_vector and hard sum-to-zero implementations.

  • Posterior predictive checks to demonstrate that the model is well-calibrated to the data.

Spatial Models with an ICAR component

Spatial auto-correlation is the tendency for adjacent areas to share similar characteristics. Conditional Auto-Regressive (CAR) and Intrinsic Conditional Auto-Regressive (ICAR) models, first introduced by Besag, 1974, account for this by pooling information from neighboring regions. The BYM model, (Besag, York, Mollié, 1991) extends a lognormal Poisson model plus ICAR component for spatial auto-correlation by adding an ordinary random-effects component for non-spatial heterogeneity. The BYM2 model builds on this model and subsequent refinements.

The ICAR, BYM2, and BYM2_multicomp models are more fully explained in a series of notebooks available from GitHub repo: https://github.com/mitzimorris/geomed_2024, see notebooks:

Example dataset: New York City traffic accidents

The dataset we’re using is that used in the analysis published in 2019 Bayesian Hierarchical Spatial Models: Implementing the Besag York Mollié Model in Stan.

The data consists of motor vehicle collisions in New York City, as recorded by the NYC Department of Transportation, between the years 2005-2014, restricted to collisions involving school age children 5-18 years of age as pedestrians. Each crash was localized to the US Census tract in which it occurred, using boundaries from the 2010 United States Census, using the 2010 Census block map for New York City. File data/nyc_study.geojson contains the study data and census tract ids and geometry.

Show Code
nyc_geodata = gpd.read_file(os.path.join('data', 'nyc_study.geojson'))
nyc_geodata.columns
nyc_geodata[['BoroName', 'NTAName', 'count', 'kid_pop']].head(4)
BoroName NTAName count kid_pop
0 Bronx Soundview-Castle Hill-Clason Point-Harding Park 9.00 894.00
1 Bronx Soundview-Castle Hill-Clason Point-Harding Park 12.00 1158.00
2 Bronx Soundview-Castle Hill-Clason Point-Harding Park 28.00 1120.00
3 Bronx Mott Haven-Port Morris 14.00 281.00

The shapefiles from the Census Bureau connect Manhattan to Brooklyn and Queens, but for this analysis, Manhattan is quite separate from Brooklyn and Queens. Getting the data assembled in the order required for our analysis requires data munging, encapsulated in the Python functions in file utils_nyc_map.py. The function nyc_sort_by_comp_size removes any neighbor pairs between tracts in Manhattan and any tracts in Brooklyn or Queens and updates the neighbor graph accordingly. It returns a clean neighbor graph and the corresponding geodataframe, plus a list of the component sizes. The list is sorted so that the largest component (Brooklyn and Queens) is first, and singleton nodes are last.

Show Code
from utils_nyc_map import nyc_sort_by_comp_size

(nyc_nbs, nyc_gdf, nyc_comp_sizes) = nyc_sort_by_comp_size(nyc_geodata)
nyc_comp_sizes
[1360, 329, 271, 108, 22, 2, 1, 1, 1]

To check our work we examine both the geodataframe and the map.

Show Code
nyc_gdf[['BoroName', 'NTAName', 'count', 'kid_pop']].head(4)
BoroName NTAName count kid_pop
0 Brooklyn East New York 6.00 432.00
1 Brooklyn Flatbush 18.00 618.00
2 Brooklyn Erasmus 15.00 891.00
3 Brooklyn Flatbush 16.00 655.00
Show Code
nyc_gdf[['BoroName', 'NTAName', 'count', 'kid_pop']].tail(4)
BoroName NTAName count kid_pop
2091 Manhattan Lenox Hill-Roosevelt Island 1.00 1134.00
2092 Queens Breezy Point-Belle Harbor-Rockaway Park-Broad ... 3.00 402.00
2093 Queens Breezy Point-Belle Harbor-Rockaway Park-Broad ... 1.00 609.00
2094 Bronx Pelham Bay-Country Club-City Island 6.00 565.00
Show Code
from splot.libpysal import plot_spatial_weights 
plot_spatial_weights(nyc_nbs, nyc_gdf)

Model 1: The BYM2 model, Riebler et al. 2016

The key element of the BYM2 model is the ICAR component. Its conditional specification is a multivariate normal random vector \(\mathbf{\phi}\) where each \({\phi}_i\) is conditional on the values of its neighbors.

The joint specification rewrites to a Pairwise Difference,

\[ p(\phi) \propto \exp \left\{ {- \frac{1}{2}} \sum_{i \sim j}{({\phi}_i - {\phi}_j)}^2 \right\} \]

Each \({({\phi}_i - {\phi}_j)}^2\) contributes a penalty term based on the distance between the values of neighboring regions. However, \(\phi\) is non-identifiable, constant added to \(\phi\) washes out of \({\phi}_i - {\phi}_j\). Therefore, a sum-to-zero constraint is needed to both identify and center \(\phi\).

The Stan implementation of the ICAR component computes the sum of the pairwise distances by representing the spatial adjacency matrix as a array of pairs of neighbor indices.

data {
  ...
  // spatial structure
  int<lower = 0> N_edges;  // number of neighbor pairs
  array[2, N_edges] int<lower = 1, upper = N> neighbors;  // columnwise adjacent

The ICAR prior comes into the model as parameter phi.

model {
  ...
  target += -0.5 * dot_self(phi[neighbors[1]] - phi[neighbors[2]]);  // ICAR prior

In this section, we compare three ways of implementing the sum-to-zero constraint on phi.

  • In model bym2_ozs.stan, phi is declared as a sum_to_zero_vector.

  • In model bym2_ozs_hard.stan, phi_raw is the unconstrained parameter of size N - 1, and the N-length parameter phi is computed in the transformed parameters block.

  • In model bym2_soft.stan, phi is declared as an ordinary vector, and the sum-to-zero constraint is combined with the prior:

  target += (-0.5 * dot_self(phi[neighbors[1]] - phi[neighbors[2]])
         + normal_lupdf(sum(phi) | 0, 0.001 * rows(phi)));

The ICAR model requires that the neighbor graph is fully connected for two reasons:

  • The joint distribution is computed from the pairwise differences between a node and its neighbors; singleton nodes have no neighbors and are therefore undefined.

  • Even if the graph doesn’t have any singleton nodes, when the graph has multiple connected components a sum-to-zero constraint on the entire vector fails to properly identify the model.

Because the BYM2 model includes an ICAR component, it too requires a fully connected neighbor graph. We can either artificially connect the map, or we can analyze the NYC dataset on a per-component basis, starting with the largest component which encompasses Brooklyn and Queens (excepting the Rockaways).

Show Code
from libpysal.weights import Queen
brklyn_qns_gdf = nyc_gdf[nyc_gdf['comp_id']==0].reset_index(drop=True)
brklyn_qns_nbs = Queen.from_dataframe(brklyn_qns_gdf , geom_col='geometry')
plot_spatial_weights(brklyn_qns_nbs, brklyn_qns_gdf ) 

print(f'number of components: {brklyn_qns_nbs.n_components}')
print(f'islands? {brklyn_qns_nbs.islands}')
print(f'max number of neighbors per node: {brklyn_qns_nbs.max_neighbors}')
print(f'mean number of neighbors per node: {brklyn_qns_nbs.mean_neighbors}')
number of components: 1
islands? []
max number of neighbors per node: 14
mean number of neighbors per node: 5.977941176470588

Data assembly

The inputs to the BYM2 model are

  • The Poisson regression data

    • int<lower=0> N - number of regions
    • array[N] int<lower=0> y - per-region count outcome
    • vector<lower=0>[N] E - the population of each region (a.k.a. “exposure”),
    • int<lower=1> K - the number of predictors
    • matrix[N, K] xs - the design matrix
  • The spatial structure

    • int<lower = 0> N_edges - the number of neighbor pairs
    • array[2, N_edges] int<lower = 1, upper = N> neighbors - the graph structure
    • real tau - the scaling factor, introduced in the BYM2

The scaling factor tau was introduced by Riebler et al so that the variance of the spatial and ordinary random effects are both approximately equal to 1, thus allowing for a straightforward estimate of the amount of spatial and non-spatial variance. We have written a helper function called get_scaling_factor, in file utils_bym2.py which takes as its argument the neighbor graph and computes the geometric mean of the corresponding adjacency matrix.

Show Code
from utils_bym2 import get_scaling_factor, nbs_to_adjlist

# design matrix
design_vars = np.array(['pct_pubtransit','med_hh_inc', 'traffic', 'frag_index'])
design_mat = brklyn_qns_gdf[design_vars].to_numpy()
design_mat[:, 1] = np.log(design_mat[:, 1])
design_mat[:, 2] = np.log(design_mat[:, 2])

# neighbors array
brklyn_qns_nbs_adj = nbs_to_adjlist(brklyn_qns_nbs)

# scaling factor
tau = get_scaling_factor(brklyn_qns_nbs)

brklyn_qns_data = {"N":brklyn_qns_gdf.shape[0],
            "y":brklyn_qns_gdf['count'].astype('int'),
            "E":brklyn_qns_gdf['kid_pop'].astype('int'),
            "K":4,
            "xs":design_mat,
            "N_edges": brklyn_qns_nbs_adj.shape[1],
            "neighbors": brklyn_qns_nbs_adj,
            "tau": tau
}

Model fitting

These models require larger numbers of warmup iterations in order to reach convergence for all parameters, including hyperparameters rho and sigma.

Show Code
bym2_ozs_mod = CmdStanModel(stan_file=os.path.join('stan', 'bym2_ozs.stan'))
brklyn_qns_ozs_fit = bym2_ozs_mod.sample(data=brklyn_qns_data, iter_warmup=5000)
12:48:16 - cmdstanpy - INFO - CmdStan start processing
                                                                                                                                                                                                                                                                                                                                
12:48:54 - cmdstanpy - INFO - CmdStan done processing.
Show Code
a_seed = brklyn_qns_ozs_fit.metadata.cmdstan_config['seed']
Show Code
bym2_soft_mod = CmdStanModel(stan_file=os.path.join('stan', 'bym2_soft.stan'))
brklyn_qns_soft_fit = bym2_soft_mod.sample(data=brklyn_qns_data, iter_warmup=5000, seed=a_seed)
12:48:55 - cmdstanpy - INFO - CmdStan start processing
                                                                                                                                                                                                                                                                                                                                
12:49:45 - cmdstanpy - INFO - CmdStan done processing.
Show Code
bym2_hard_mod = CmdStanModel(stan_file=os.path.join('stan', 'bym2_hard.stan'))
brklyn_qns_hard_fit = bym2_hard_mod.sample(data=brklyn_qns_data, iter_warmup=5000, seed=a_seed)
12:49:46 - cmdstanpy - INFO - CmdStan start processing
                                                                                                                                                                                                                                                                                                                                
12:51:52 - cmdstanpy - INFO - CmdStan done processing.

Model Comparison

Get summaries and compare fits.

Show Code
brklyn_qns_ozs_summary = brklyn_qns_ozs_fit.summary()
brklyn_qns_ozs_summary.index =  brklyn_qns_ozs_summary.index.astype(str) + "  a) ozs"
ozs_fit_time = brklyn_qns_ozs_fit.time
ozs_total_time = 0
for i in range(len(ozs_fit_time)):
    ozs_total_time += ozs_fit_time[i]['total']
brklyn_qns_ozs_summary['ESS/sec'] = brklyn_qns_ozs_summary['ESS_bulk']/ozs_total_time


brklyn_qns_hard_summary = brklyn_qns_hard_fit.summary()
brklyn_qns_hard_summary.index = brklyn_qns_hard_summary.index.astype(str) + "  b) hard"
hard_fit_time = brklyn_qns_hard_fit.time
hard_total_time = 0
for i in range(len(hard_fit_time)):
    hard_total_time += hard_fit_time[i]['total']
brklyn_qns_hard_summary['ESS/sec'] = brklyn_qns_hard_summary['ESS_bulk']/hard_total_time

brklyn_qns_soft_summary = brklyn_qns_soft_fit.summary()
brklyn_qns_soft_summary.index = brklyn_qns_soft_summary.index.astype(str) + "  c) soft"
soft_fit_time = brklyn_qns_soft_fit.time
soft_total_time = 0
for i in range(len(soft_fit_time)):
    soft_total_time += soft_fit_time[i]['total']
brklyn_qns_soft_summary['ESS/sec'] = brklyn_qns_soft_summary['ESS_bulk']/soft_total_time

brklyn_qns_fits_summary = pd.concat([brklyn_qns_ozs_summary, brklyn_qns_hard_summary, brklyn_qns_soft_summary])
Show Code
beta_summary = summarize_predictor(brklyn_qns_fits_summary, 'beta')
sigma_summary = summarize_predictor(brklyn_qns_fits_summary, 'sigma')
rho_summary = summarize_predictor(brklyn_qns_fits_summary, 'rho')

brklyn_qns_summary = pd.concat([beta_summary, sigma_summary, rho_summary])
Show Code
display(HTML(style_dataframe(brklyn_qns_summary, 3).to_html()))
  Mean StdDev ESS_bulk ESS/sec R_hat
beta0 a) ozs -4.45 0.02 1700.71 11.35 1.00
beta0 b) hard -4.45 0.02 1698.68 3.46 1.00
beta0 c) soft -4.45 0.02 2562.40 12.96 1.00
beta_intercept a) ozs -7.98 0.95 598.50 3.99 1.00
beta_intercept b) hard -7.93 0.94 521.65 1.06 1.00
beta_intercept c) soft -8.00 0.97 718.60 3.63 1.01
betas[1] a) ozs 1.25 0.24 301.24 2.01 1.01
betas[1] b) hard 1.21 0.25 306.50 0.62 1.02
betas[1] c) soft 1.22 0.25 448.98 2.27 1.01
betas[2] a) ozs 0.16 0.08 609.48 4.07 1.00
betas[2] b) hard 0.16 0.08 517.09 1.05 1.00
betas[2] c) soft 0.17 0.08 766.05 3.87 1.01
betas[3] a) ozs 0.09 0.03 1285.79 8.58 1.01
betas[3] b) hard 0.09 0.03 1071.16 2.18 1.00
betas[3] c) soft 0.09 0.03 2417.04 12.22 1.00
betas[4] a) ozs 0.04 0.02 898.50 5.99 1.00
betas[4] b) hard 0.04 0.02 673.46 1.37 1.01
betas[4] c) soft 0.04 0.02 1009.15 5.10 1.00
sigma a) ozs 0.75 0.03 160.37 1.07 1.04
sigma b) hard 0.75 0.03 188.83 0.38 1.01
sigma c) soft 0.75 0.03 81.05 0.41 1.04
rho a) ozs 0.38 0.07 92.61 0.62 1.06
rho b) hard 0.39 0.07 127.98 0.26 1.03
rho c) soft 0.40 0.08 80.53 0.41 1.06

We can repeat this procedure with the next largest component, the Bronx (excepting City Island), which has 329 regions, roughly 1/3 of the size of Brooklyn-Queens, with 1360 regions.

Show Code
bronx_gdf = nyc_gdf[nyc_gdf['comp_id']==1].reset_index(drop=True)
print(f'number of regions: {bronx_gdf.shape[0]}')
bronx_nbs = Queen.from_dataframe(bronx_gdf , geom_col='geometry')
plot_spatial_weights(bronx_nbs, bronx_gdf ) 

print(f'number of components: {bronx_nbs.n_components}')
print(f'islands? {bronx_nbs.islands}')
print(f'max number of neighbors per node: {bronx_nbs.max_neighbors}')
print(f'mean number of neighbors per node: {bronx_nbs.mean_neighbors}')
number of regions: 329
number of components: 1
islands? []
max number of neighbors per node: 12
mean number of neighbors per node: 5.860182370820668

Show Code
# design matrix
design_vars = np.array(['pct_pubtransit','med_hh_inc', 'traffic', 'frag_index'])
design_mat = bronx_gdf[design_vars].to_numpy()
design_mat[:, 1] = np.log(design_mat[:, 1])
design_mat[:, 2] = np.log(design_mat[:, 2])

# neighbors array
bronx_nbs_adj = nbs_to_adjlist(bronx_nbs)

# scaling factor
tau = get_scaling_factor(bronx_nbs)

bronx_data = {"N":bronx_gdf.shape[0],
              "y":bronx_gdf['count'].astype('int'),
              "E":bronx_gdf['kid_pop'].astype('int'),
              "K":4,
              "xs":design_mat,
              "N_edges": bronx_nbs_adj.shape[1],
              "neighbors": bronx_nbs_adj,
              "tau": tau
}
Show Code
bronx_ozs_fit = bym2_ozs_mod.sample(data=bronx_data, iter_warmup=5000)
12:52:34 - cmdstanpy - INFO - CmdStan start processing
                                                                                                                                                                                                                                                                                                                                
12:52:42 - cmdstanpy - INFO - CmdStan done processing.
Show Code
a_seed = bronx_ozs_fit.metadata.cmdstan_config['seed']
Show Code
bronx_soft_fit = bym2_soft_mod.sample(data=bronx_data, iter_warmup=5000, seed=a_seed)
12:52:42 - cmdstanpy - INFO - CmdStan start processing
                                                                                                                                                                                                                                                                                                                                
12:53:05 - cmdstanpy - INFO - CmdStan done processing.
Show Code
bronx_hard_fit = bym2_hard_mod.sample(data=bronx_data, iter_warmup=5000, seed=a_seed)
12:53:05 - cmdstanpy - INFO - CmdStan start processing
                                                                                                                                                                                                                                                                                                                                
12:53:30 - cmdstanpy - INFO - CmdStan done processing.
12:53:30 - cmdstanpy - WARNING - Some chains may have failed to converge.
    Chain 4 had 1 divergent transitions (0.1%)
    Use the "diagnose()" method on the CmdStanMCMC object to see further information.
Show Code
bronx_ozs_summary = bronx_ozs_fit.summary()
bronx_ozs_summary.index =  bronx_ozs_summary.index.astype(str) + "  a) ozs"
ozs_fit_time = bronx_ozs_fit.time
ozs_total_time = 0
for i in range(len(ozs_fit_time)):
    ozs_total_time += ozs_fit_time[i]['total']
bronx_ozs_summary['ESS/sec'] = bronx_ozs_summary['ESS_bulk']/ozs_total_time


bronx_hard_summary = bronx_hard_fit.summary()
bronx_hard_summary.index = bronx_hard_summary.index.astype(str) + "  b) hard"
hard_fit_time = bronx_hard_fit.time
hard_total_time = 0
for i in range(len(hard_fit_time)):
    hard_total_time += hard_fit_time[i]['total']
bronx_hard_summary['ESS/sec'] = bronx_hard_summary['ESS_bulk']/hard_total_time

bronx_soft_summary = bronx_soft_fit.summary()
bronx_soft_summary.index = bronx_soft_summary.index.astype(str) + "  c) soft"
soft_fit_time = bronx_soft_fit.time
soft_total_time = 0
for i in range(len(soft_fit_time)):
    soft_total_time += soft_fit_time[i]['total']
bronx_soft_summary['ESS/sec'] = bronx_soft_summary['ESS_bulk']/soft_total_time

bronx_fits_summary = pd.concat([bronx_ozs_summary, bronx_hard_summary, bronx_soft_summary])
Show Code
beta_summary = summarize_predictor(bronx_fits_summary, 'beta')
sigma_summary = summarize_predictor(bronx_fits_summary, 'sigma')
rho_summary = summarize_predictor(bronx_fits_summary, 'rho')

bronx_summary = pd.concat([beta_summary, sigma_summary, rho_summary])

display(HTML(style_dataframe(bronx_summary, 3).to_html()))
  Mean StdDev ESS_bulk ESS/sec R_hat
beta0 a) ozs -4.45 0.04 1598.47 49.24 1.01
beta0 b) hard -4.45 0.04 1613.18 16.58 1.00
beta0 c) soft -4.45 0.04 1672.65 18.84 1.00
beta_intercept a) ozs -4.74 1.93 777.59 23.95 1.01
beta_intercept b) hard -4.67 1.94 515.79 5.30 1.01
beta_intercept c) soft -4.74 1.92 655.13 7.38 1.01
betas[1] a) ozs 0.76 0.48 1055.64 32.52 1.01
betas[1] b) hard 0.73 0.47 644.12 6.62 1.01
betas[1] c) soft 0.72 0.46 780.16 8.79 1.00
betas[2] a) ozs -0.10 0.16 549.40 16.92 1.01
betas[2] b) hard -0.10 0.16 422.79 4.35 1.01
betas[2] c) soft -0.09 0.16 514.17 5.79 1.01
betas[3] a) ozs 0.07 0.05 1521.94 46.89 1.00
betas[3] b) hard 0.07 0.05 1491.25 15.33 1.00
betas[3] c) soft 0.07 0.05 1396.18 15.72 1.00
betas[4] a) ozs 0.04 0.03 1763.22 54.32 1.00
betas[4] b) hard 0.04 0.03 1230.65 12.65 1.00
betas[4] c) soft 0.04 0.03 1305.71 14.71 1.00
sigma a) ozs 0.64 0.04 411.13 12.67 1.01
sigma b) hard 0.64 0.04 355.42 3.65 1.02
sigma c) soft 0.64 0.04 731.19 8.23 1.00
rho a) ozs 0.20 0.11 160.42 4.94 1.02
rho b) hard 0.20 0.11 127.10 1.31 1.03
rho c) soft 0.20 0.11 225.15 2.54 1.03

Discussion

All implementations return almost identical estimates. The sum_to_zero_vector consistently has the fastest running time. The marginal variances of the spatial component phi are roughly the same across all models; presumably due to the fact that the ICAR prior is properly constraining the variances.

Model 2: The BYM2_multicomp model, Freni-Sterrantino et al, 2018

In the previous section, we analyzed the New York City component-wise. This is highly unsatisfactory. In order to apply the BYM2 model to the full NYC dataset, it is necessary to extend the BYM2 model to account for disconnected components and singleton nodes.

This has been done by Freni-Sterrantino et al. in 2018 for INLA, and presented in: A note on intrinsic Conditional Autoregressive models for disconnected graphs. They provide the following recommendations:

  • Non-singleton nodes are given the BYM2 prior
  • Singleton nodes (islands) are given a standard Normal prior
  • Compute per-connected component scaling factor
  • Impose a sum-to-zero constraint on each connected component

We have followed these recommendations and implemented this model in Stan. The full model is in file stan/bym2_multicomp.stan. The full implementation details can be found in the notebook * The BYM2_multicomp model in Stan

For this case study, we provide 2 implementations of the BYM2_multicomp model: one which uses the sum_to_zero_vector and one which implements the soft sum-to-zero constraint.

It is necessary to constrain the the elements of the spatial effects vector phi on a component-by-component basis. Stan’s slicing with range indexes, provides a way to efficiently access each component. The helper function nyc_sort_by_comp_size both sorts the study data by component and adds the component index to the geodataframe.

In the BYM2 model for a fully connected graph the sum-to-zero constraint on phi is implemented directly by declaring phi to be a sum_to_zero_vector, which is a constrained parameter type. The declaration:

  sum_to_zero_vector[N] phi;  // spatial effects

creates a constrained variable of length \(N\), with a corresponding unconstrained variable of length \(N-1\).

In order to constrain slices of the parameter vector phi, we do the following:

  • In the parameters block, we declare the unconstrained parameter phi_raw as an regular vector vector (instead of a sum_to_zero_vector).
    • For a fully connected graph of size \(N\), the size of the unconstrained sum-to-zero vector is \(N-1\). For a disconnected graph with \(M\) non-singleton nodes, the size of phi_raw is \(M\) minus the number of connected components.
  vector[N_connected - N_components] phi_raw;  // spatial effects
  • In the functions block, we implement the unconstraining transform.

  • In the transformed parameters block, we apply the constraining transform.

  vector[N_connected] phi = zero_sum_components_lp(phi_raw, component_idxs, component_sizes);

The constraining transform is broken into two functions:

  • function zero_sum_constrain_lp, the actual constraining transform, which corresponds directly to the built-in zero_sum_vector transform.

  • function zero_sum_constrain_components_lp, which handles the slicing, and calls zero_sum_constrain_lp on each component.

  /**
   * Constrain sum-to-zero vector
   *
   * @param y unconstrained zero-sum parameters
   * @return vector z, the vector whose slices sum to zero
   */
  vector zero_sum_constrain_lp(vector y) {
    int N = num_elements(y);
    vector[N + 1] z = zeros_vector(N + 1);
    real sum_w = 0;
    for (ii in 1:N) {
      int i = N - ii + 1; 
      real n = i;
      real w = y[i] * inv_sqrt(n * (n + 1));
      sum_w += w;
      z[i] += sum_w;     
      z[i + 1] -= w * n;    
    }
    return z;
  }
  • zero_sum_components_lp: slices vector phi by component, applies constraining transform to each.
  /**
   * Component-wise constrain sum-to-zero vectors
   *
   * @param phi unconstrained vector of zero-sum slices
   * @param idxs component start and end indices
   * @param sizes component sizes
   * @return vector phi_ozs, the vector whose slices sum to zero
   */
  vector zero_sum_components_lp(vector phi,
                                array[ , ] int idxs,
                                array[] int sizes) {
    vector[sum(sizes)] phi_ozs;
    int idx_phi = 1;
    int idx_ozs = 1;
    for (i in 1:size(sizes)) {
      phi_ozs[idx_ozs : idx_ozs + sizes[i] - 1] =
        zero_sum_constrain_lp(segment(phi, idx_phi, sizes[i] - 1));
      idx_phi += sizes[i] - 1;
      idx_ozs += sizes[i];
    }
    return phi_ozs;
  }

Data Assembly

The helper function nyc_soft_by_comp_size adds component info to the geodataframe. It also returns the neighbor graph over the full dataset, plus a list of component sizes.

Show Code
from utils_nyc_map import nyc_sort_by_comp_size
from utils_bym2 import nbs_to_adjlist, get_scaling_factors

(nyc_nbs, nyc_gdf, nyc_comp_sizes) = nyc_sort_by_comp_size(nyc_geodata)

# design matrix
design_vars = np.array(['pct_pubtransit','med_hh_inc', 'traffic', 'frag_index'])
design_mat = nyc_gdf[design_vars].to_numpy()
design_mat[:, 1] = np.log(design_mat[:, 1])
design_mat[:, 2] = np.log(design_mat[:, 2])

# spatial structure
nyc_nbs_adj = nbs_to_adjlist(nyc_nbs)
component_sizes = [x for x in nyc_comp_sizes if x > 1]
scaling_factors = get_scaling_factors(len(component_sizes), nyc_gdf)

We assemble all inputs into dictionary bym2_multicomp_data.

Show Code
bym2_multicomp_data = {
    "N":nyc_gdf.shape[0],
    "y":nyc_gdf['count'].astype('int'),
    "E":nyc_gdf['kid_pop'].astype('int'),
    "K":4,
    "xs":design_mat,
    "N_edges": nyc_nbs_adj.shape[1],
    "neighbors": nyc_nbs_adj,
    "N_components": len(component_sizes),
    "component_sizes": component_sizes,
    "scaling_factors": scaling_factors
}

Model Fitting

Show Code
bym2_multicomp_ozs_file = os.path.join('stan', 'bym2_multicomp.stan')
bym2_multicomp_ozs_mod = CmdStanModel(stan_file=bym2_multicomp_ozs_file)
Show Code
bym2_multicomp_ozs_fit = bym2_multicomp_ozs_mod.sample(data=bym2_multicomp_data, iter_warmup=3000)
12:53:41 - cmdstanpy - INFO - CmdStan start processing
                                                                                                                                                                                                                                                                                                                                
12:54:59 - cmdstanpy - INFO - CmdStan done processing.
Show Code
bym2_multicomp_ozs_summary = bym2_multicomp_ozs_fit.summary()
ozs_fit_time = bym2_multicomp_ozs_fit.time
ozs_total_time = 0
for i in range(len(ozs_fit_time)):
    ozs_total_time += ozs_fit_time[i]['total']
bym2_multicomp_ozs_summary['ESS/sec'] = bym2_multicomp_ozs_summary['ESS_bulk']/ozs_total_time

bym2_multicomp_ozs_summary[['Mean', 'StdDev', 'ESS_bulk', 'ESS/sec', 'R_hat']].round(2).loc[
  ['beta_intercept', 'beta0', 'betas[1]', 'betas[2]', 'betas[3]', 'betas[4]', 'sigma', 'rho']]
Mean StdDev ESS_bulk ESS/sec R_hat
beta_intercept -5.81 0.58 663.69 2.21 1.01
beta0 -4.48 0.02 1420.89 4.73 1.00
betas[1] 0.82 0.16 498.59 1.66 1.02
betas[2] 0.04 0.05 614.40 2.04 1.01
betas[3] 0.03 0.02 1008.64 3.36 1.00
betas[4] 0.04 0.01 803.43 2.67 1.01
sigma 0.78 0.03 137.69 0.46 1.01
rho 0.39 0.06 85.86 0.29 1.02

Model Comparison

We can compare the sum_to_zero_vector implementation to an implementation of the soft sum-to-zero constraint. In the bym2_soft.stan model, the soft sum-to-zero constraint is combined directly with the ICAR prior:

  target += (-0.5 * dot_self(phi[neighbors[1]] - phi[neighbors[2]])
         + normal_lupdf(sum(phi) | 0, 0.001 * rows(phi)));

For the BYM2_multicomp model, split this into two parts. The ICAR prior is applied to phi, and then we iterate through the components, applying the sum-to-zero constraint to each in turn.

  target += -0.5 * dot_self(phi[neighbors[1]] - phi[neighbors[2]]);  // ICAR
  for (n in 1:N_components) {   // component-wise sum-to-zero constraint
    sum(phi[node_idxs[n, 1] : node_idxs[n, 2]]) ~ normal(0,
                             0.001 * component_sizes[n]);

The data inputs are the same. To ensure (roughly) the same initialization, we reuse the seed from bym2_multicomp_ozs_fit.

Show Code
a_seed = bym2_multicomp_ozs_fit.metadata.cmdstan_config['seed']
Show Code
bym2_multicomp_soft_file = os.path.join('stan', 'bym2_multicomp_soft.stan')
bym2_multicomp_soft_mod = CmdStanModel(stan_file=bym2_multicomp_soft_file)

This model fits very slowly and requires increasing the max_treedepth; at the default setting, all iterations hit this limit.

Show Code
bym2_multicomp_soft_fit = bym2_multicomp_soft_mod.sample(data=bym2_multicomp_data, iter_warmup=3000, max_treedepth=14, seed=a_seed)
12:55:24 - cmdstanpy - INFO - CmdStan start processing
                                                                                                                                                                                                                                                                                                                                
13:13:57 - cmdstanpy - INFO - CmdStan done processing.
Show Code
bym2_multicomp_soft_summary = bym2_multicomp_soft_fit.summary()

We compare the result of both implementations.

Show Code
bym2_multicomp_ozs_summary.index =  bym2_multicomp_ozs_summary.index.astype(str) + "  a) ozs"

bym2_multicomp_soft_summary.index =  bym2_multicomp_soft_summary.index.astype(str) + "  b) soft"
soft_fit_time = bym2_multicomp_soft_fit.time
soft_total_time = 0
for i in range(len(soft_fit_time)):
    soft_total_time += soft_fit_time[i]['total']
bym2_multicomp_soft_summary['ESS/sec'] = bym2_multicomp_soft_summary['ESS_bulk']/soft_total_time

bym2_multicomp_summary = pd.concat([bym2_multicomp_ozs_summary, bym2_multicomp_soft_summary])

beta_summary = summarize_predictor(bym2_multicomp_summary, 'beta')
sigma_summary = summarize_predictor(bym2_multicomp_summary, 'sigma')
rho_summary = summarize_predictor(bym2_multicomp_summary, 'rho')
nyc_summary = pd.concat([beta_summary, sigma_summary, rho_summary])
Show Code
display(HTML(style_dataframe(nyc_summary, 2).to_html()))
  Mean StdDev ESS_bulk ESS/sec R_hat
beta0 a) ozs -4.48 0.02 1420.89 4.73 1.00
beta0 b) soft -4.48 0.02 1415.98 0.33 1.00
beta_intercept a) ozs -5.81 0.58 663.69 2.21 1.01
beta_intercept b) soft -5.78 0.56 831.93 0.19 1.00
betas[1] a) ozs 0.82 0.16 498.59 1.66 1.02
betas[1] b) soft 0.81 0.16 645.44 0.15 1.01
betas[2] a) ozs 0.04 0.05 614.40 2.04 1.01
betas[2] b) soft 0.04 0.05 762.71 0.18 1.00
betas[3] a) ozs 0.03 0.02 1008.64 3.36 1.00
betas[3] b) soft 0.03 0.02 1250.82 0.29 1.00
betas[4] a) ozs 0.04 0.01 803.43 2.67 1.01
betas[4] b) soft 0.04 0.01 731.65 0.17 1.01
sigma a) ozs 0.78 0.03 137.69 0.46 1.01
sigma b) soft 0.78 0.02 242.67 0.06 1.01
rho a) ozs 0.39 0.06 85.86 0.29 1.02
rho b) soft 0.39 0.05 161.15 0.04 1.02

Discussion

The BYM2 model has more data and a relatively complex multilevel structure. Before Stan 2.36, for this model and dataset, the soft sum-to-zero constraint was much faster than the hard sum-to-zero constraint. Here we show that the sum_to_zero_vector greatly improves the run time.

For the BYM2_multicomp model, stan/bym2_multicomp.stan shows how to implement the sum_to_zero_vector constraining transform as a Stan function. A comparable implementation using the soft sum-to-zero implementation is painfully slow. Both implementation get exactly the same estimates, which simply confirms that both models are correctly implemented. The dramatic difference in run times speaks for itself.

Conclusion: the sum_to_zero_vector just works!

The more complex the model, the greater the need for the sum_to_zero_vector. When considering the effective sample size, it is important to remember that what is most important is effective samples per second. In these experiments, the sum_to_zero_vector models consistently have the best wall-clock time and the highest ESS/sec.

References